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Abstract 

We propose a modular framework for multi- 
relational learning via tensor decomposition. 
In our learning setting, the training data con- 
tains multiple types of relationships among a 
set of objects, which we represent by a sparse 
three-mode tensor. The goal is to predict 
the values of the missing entries. To do so, 
we model each relationship as a function of 
a linear combination of latent factors. We 
learn this latent representation by comput- 
ing a low-rank tensor decomposition, using 
quasi-Newton optimization of a weighted ob- 
jective function. Sparsity in the observed 
data is captured by the weighted objective, 
leading to improved accuracy when training 
data is limited. Exploiting sparsity also im- 
proves efficiency, potentially up to an order of 
magnitude over unweighted approaches. In 
addition, our framework accommodates arbi- 
trary combinations of smooth, task-specific 
loss functions, making it better suited for 
learning different types of relations. For the 
typical cases of real-valued functions and bi- 
nary relations, we propose several loss func- 
tions and derive the associated parameter 
gradients. We evaluate our method on syn- 
thetic and real data, showing significant im- 
provements in both accuracy and scalability 
over related factorization techniques. 

1 Introduction 

In network or relational data, one often finds multiple 
types of relations on a set of objects. For instance, in 
social networks, relationships between individuals may 
be personal, familial, or professional. We refer to this 
type of data as multi-relational. In this paper, we pro- 
pose a tensor decomposition model for transduction on 



multi-relational data. We consider a scenario in which 
we are given a fixed set of objects, a set of relations 
and a small training set, sampled from the full set of all 
potential pairwise relationships; our goal is to predict 
the unobserved relationships. The relations we con- 
sider may be binary-, discrete ordinal- or real- valued 
functions of the object pairs; for the binary-valued re- 
lationships, the training labels include both positive 
and negative examples. 

There has been a growing interest in tensor methods 
within machine learning, partially due to their natu- 
ral representation of multi-relational data (Kashima 
et al., 2009). Many contributions (Dunlavy et al., 
2006, 2011; Gao et al., 2011; Xiong et al., 2010) use the 
canonical polyadic (CP) decomposition, a generaliza- 
tion of singular value decomposition to tensors. Others 
(Badcr et al., 2007) have proposed models based on de- 
composition into directional components (DEDICOM) 
(Harshman, 1978). We propose a similar decompo- 
sition (based on (Nickel ct al., 2011)) which is more 
appropriate for multi-relational data, for reasons dis- 
cussed in Section 3.2. Unlike these previous methods, 
we do not attempt to decompose the input tensor di- 
rectly; rather, we explicitly model a mapping from the 
low-rank representation to the observed tensor, which 
is often better suited for prediction. For example, a 
binary relationship can be modeled as the sign of a 
latent representation; this gives the latent representa- 
tion more freedom to increase the prediction margin, 
rather than reproduce {±1} exactly. In this respect, 
approaches like maximum-margin matrix factorization 
(MMMF) (Srebro et al., 2005b; Rennie and Srebro, 
2005a) and DEDICOM can be viewed as specializa- 
tions of our framework. 

Our proposed method, multi-relational weighted ten- 
sor decomposition (see Section 3), assumes that the 
latent representation is determined by a linear com- 
bination of latent factors associated with each object. 
Learning these latent factors and their interactions in 
each relation thus becomes analogous to a weighted 
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tensor decomposition (described in Section 3.1 and il- 
lustrated in Figure 1). We formulate this decomposi- 
tion as a nonlinear optimization problem (Section 3.3), 
which may incorporate any combination of smooth, 
task-specific loss functions. These task-specific loss 
functions allow simultaneous learning of various rela- 
tion types, such as binary- and continuous- valued. By 
weighting the objective function, we are able to learn 
from limited observed (training) relationships without 
fitting the unobserved (testing) ones, improving both 
accuracy and efficiency. We demonstrate the effective- 
ness of our approach in Section 4, using both real and 
synthetic data experiments. Our results indicate that 
our approach is both more accurate and efficient than 
competing factorizations when training data is sparse. 

2 Preliminciries 

This section introduces our notation and defines the 
problem of multi-relational transduction. 

We denote tensors and matrices using bold, uppercase 
letters; similarly, we use bold, lowercase letters to de- 
note vectors. For a tensor X, let Xij^k denote the 
(?, j)'^ element of the fc*^ frontal slice. Denote by Xfe 
the matrix comprising the k^^ frontal slice. We use 
to denote the Hadamard (i.e., element-wise) product, 
tr(-) for the trace operator and ||-||p for the Frobenius 
norm. For a matrix V and function /, let Vv/ denote 
the gradient of / with respect to V. 

Fix a set of m objects and a set of n relations.^ To 

simplify our analysis, we assume that all relations 
are symmetric, though one can obtain an analogous 
derivation for asymmetric relations with only slightly 
more work. We are given a partially observed tensor 
Y e j^mxmxra^ which each observed entry yij^k is 
a (possibly noisy) measurement of a relationship and 
each imobserved entry is set to a null value. ^ We 
are additionally given a nonnegative weighting tensor 
W e jg-t-™^™^"^ where each entry Wij^k € [0, 1] cor- 
responds to a user-defined confidence, or certainty, in 
the value of yij,kl if 2/i,j,fc is unobserved, then Wij^k 
is necessarily zero. The goal of multi-relational trans- 
duction in this tensor formulation is to infer the unob- 
served entries in Y. 

3 Proposed Method 

This section introduces our proposed method, which 
we refer to as multi-relational weighted tensor decom- 

^Here, we use the term relation loosely to include not 
only strict relations, for which relationships are either 
present or not, but also real- valued functions. 

^For example, for binary-valued relations in {±1}, the 
null value is 0. 



position (MrWTD). We begin by describing our low- 
rank tensor representation of multi-relational data. 
We then define an optimization objective used to com- 
pute this representation and discuss how we solve the 
optimization. 

3.1 Representation as Tensor Decomposition 

Our fundamental assumption is that each relationship 

is equal to a mapping $fc applied to an element Xi j ^ in 
an underlying low-rank tensor X G Each ^k 

depends on the nature of the relation, and may differ 
across relations. For example, for binary relations in 
{±1}, is the sign function. We further assume that 
each Xfc can be factored as a rank-r decomposition 

Xfe = ARfcA^+6fc, (1) 

where A e K™^^'', e W""^ and bk € M. (Figure 1 
illustrates this decomposition.) Note that there is a 
single A matrix, but n instances of R^, and bk- Also 
note that we place no constraints on A or R^; the 
columns of A need not be linearly independent, and 
Rfc need not be positive-semidefinite. To infer the val- 
ues of the missing (or uncertain) entries, we predict 
each yij,k by computing Xij^k = ajRfeaJ + bk, where 
a; and aj are the ?'"^ and j^^ row vectors of A, and 
then apply the appropriate mapping ^ki^i.j^k)- 

The entries of A can be interpreted as the global la- 
tent factors of the objects, where the i*^ row aj cor- 
responds to the latent factors of object i. Each R^ 
determines the interactions of A in the fc"^ relation. 
Thus, each predicted relationship comes from a linear 
combination of the objects' latent factors. Because 
the latent factors are global, information propagates 
between relations during the decomposition, thus en- 
abling collective learning. The addition of bk accounts 
for distributional bias within each relation. 

3.2 Related Models 

Our tensor model is comparable to Harshman's DEDI- 
COM (1978). Bader et al. (2007) applied the DEDl- 
COM model to the task of temporal link prediction (in 
a single network), using the third mode as the time 
dimension. Recently, Nickel et al. (2011) proposed a 
relaxed DEDICOM, referred to as RESCAL, to solve 
several canonical multi-relational learning tasks. Of 
the previous approaches, our underlying decomposi- 
tion is most similar to RESCAL, and Equation 1 would 
be identical to the RESCAL decomposition if not for 
the bias term. Beyond the decomposition, the key dis- 
tinction is that RESCAL directly decomposes the in- 
put tensor, rather than modeling the mapping from X 
to Y. RESCAL also ignores the potential sparsity and 
uncertainty in the observations, whereas we explicitly 
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Figure 1: For k = 1, . . . , n, each slice of the input tensor is approximated by a function of a low-rank 
decomposition ARk +bk- The latent factors A are common to all slices. Each determines the interactions 
of A in the k^^ relation, while bk accounts for distributional bias. 



model this. We demonstrate in Section 4 that our 
formulation produces more accurate predictions even 
when observed (training) data is limited. 

Other tensor factorization models have been proposed 
for multi-relational data, though they typically use the 
CP decomposition (Dunlavy et al., 2006, 2011; Gao 
et al., 2011; Xiong et al., 2010). In the CP decomposi- 
tion, each entry is the inner product of three vectors; 
this would be similar to our decomposition if each 
slice were constrained to be diagonal. The richer in- 
teractions of the relaxed DEDICOM and the global 
latent representation of the objects often make it bet- 
ter suited for multi-relational learning, as was corrob- 
orated empirically by Nickel et al. (2011). 

3.3 Objective 

To compute the decomposition in Equation 1, we min- 
imize the following regularized objective: 

/(A,R,b)^^||A||^ 

" A 

+ E 2 11^*^1 If (Wfc(4(Yfe,Xfe))^) , (2) 

where A > is a regularization parameter, is com- 
puted by Equation 1, and £k is a loss function that 
is applied element-wise to the fc*^ slice. (For brevity, 
we use / to denote /(A,R, b).) This ability to com- 
bine multiple loss functions is central to our approach, 
as the appropriate penalty depends on the mapping 
for each X^, to Y^. Though most matrix and ten- 
sor decompositions focus on minimizing the quadratic 
loss (defined below), this criterion may not be opti- 
mal for certain prediction tasks (such as binary predic- 
tion). By explicitly making the loss function for each 
slice task-specific, our framework offers more flexibility 
than related techniques. The only requirement (due to 
our optimization method) is that the loss function is 
smooth. 

It is important to note our use of L2 regularization. 
Regularization effectively controls the complexity of 



the model and thereby reduces the possibility of over- 
fitting. This follows the traditional wisdom that "sim- 
pler" models will generalize better to unseen data — in 
this case, the unobserved tensor entries. The rank of 
the decomposition can also be seen as a complexity 
parameter, since higher ranks will better fit the ob- 
served data. However, after a certain point, increasing 
the rank has a diminishing effect, since the regularizer 
seeks to minimize the Probenius norm of the decom- 
position. We explore the effect of the rank parameter 
empirically in Section 4.4. 

To minimize Equation 2, we require the gradients of / 
w.r.t. A, Rfc and bk- Leveraging the symmetry of Rfe, 
we derive^ these as 

n 

Va/ = AA +J2 2(Wfe © Vx,4(Yfc, Xfe))ARj, (3) 
fe=i 

VrJ = ARfe + A^ (Wfe Vx,4(Yfe,Xfc)) A, (4) 
VbJ = tr (Wfc(Vx,4(Yfe, Xk)f) , (5) 

where denotes the Hadamard (i.e., element- wise) 
product, and Vxfc4(Yfe,Xfe) is the gradient of £k 
w.r.t. X-k- Though this accommodates any differen- 
tiable loss function, we now present three that are ap- 
plicable to many relational problems, and derive their 
corresponding loss gradients. 

Quadratic Loss: The most common loss function 
used in matrix and tensor factorization is the quadratic 
loss, which we denote by i'^{y,x) = ^{y — x)'^- Min- 
imizing the quadratic loss corresponds to the setting 
in which each relationship is directly approximated by 
a linear combination of latent factors; i.e., is the 
identity and Yj. » X^. For this loss function, the loss 
gradient is simply VxJl{Yk,'Xk) = (X^ - Y^). 

Smooth Hinge Loss: While the quadratic loss may 
be appropriate for learning real-valued functions, it 

^Duo to space restrictions, we state the gradients with- 
out their derivation. 
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is sometimes ill-suited for learning binary relations, 
which are essentially binary classifications. For binary 
classification, the goal is to complete a partially ob- 
served slice Yj. G ^-^I'^mxm j^gcall that the mapping 
is the sign function, and so yij,k ~ sgn{xij^k)- Ap- 
proximating {±1} with a quadratic penalty may yield 
a "small-margin" solution, since high-confidence pre- 
dictions will push low-confidence predictions closer to 
the decision boundary. To get a "large-margin" solu- 
tion, we use the smooth hinge loss (Rennie and Srebro, 
2005a), i'^{y,x) = h{yx), where 

(l/2-z ifz<0, 
h{z)^ ai-zf/2 ifO<;2<l, 
[o ifz>f. 

Unlike the standard hinge loss, the smooth hinge is 
differentiable everywhere. To obtain closed-form gra- 
dients, we define tensors P,Q e ]l™x"»x"^ where 

^ 1 f if < yi,j,kXij,k < 1, 
lo otherwise, 

and 

A I 1 i-^ yi,j,kXi,j,k ^ 1) 
Qi j k ~ \ 

I otherwise. 
We can therefore express the smooth hinge as 

^kyi,j^kXiJ^k~^QiJ,k^/2^ 

which we can differentiate w.r.t. to obtain 

Vx,^i^(Yfc, Xfc) = (Pfc Xfc - Qfc Yfc). 

Logistic Loss: For binary relations, we can also 
use the logistic loss (Rennie and Srebro, 2005b), de- 
fined as f}{y,x) = log(l + e~y^)). From a statistical 
perspective, this corresponds to the negative condi- 
tional log-likelihood of a logistic model. Note that 
this loss function also maximizes the binary predic- 
tion margin yx. The gradient of is easily de- 
rived as Vxt,^fc(Yfe,Xfe) = -Yk Zfe, where Zij^k = 
(1 + 6^=)-^. 

3.4 Weighting and Efficiency 

The weighting tensor W is a particularly important 
component of our framework. Without W, the ob- 
jective function would place equal importance on fit- 
ting both observed and unobserved values. If the ob- 
served tensor is very sparse (as it often is in real train- 
ing data), this will result in fitting a large number 
of "phantom zeros". The weighting tensor prevents 
this from happening by emphasizing only the observed 
(or certain) entries. We can thus train on a small 
number of observations without fitting the unobserved 



data. This approach is similar to Acar et al."s (2010), 
though their analysis is limited to the minimizing the 
quadratic loss for a CP decomposition. 

Weighting the objective by W also leads to an im- 
provement in efficiency. When W is sparse, the ob- 
jective and gradient calculations are fairly lightweight, 
because any expression involving W can be computed 
using sparse arithmetic. For instance, X^ only appears 
in a Hadamard product with W^, so Equation 1 can 
be implemented as a sparse outer product, where we 
only compute Xij,k ior any nonzero Wij,k- In Equa- 
tions 2-5, the only expressions that do not involve W 
are the regularization terms. Thus, when W has only 
c nonzero elements, the computational costs of these 
equations are 0(ncr -|- nmr^). In contrast, methods 
that ignore the sparsity of the observed tensor take 
Oinm^r + nmr^) time. Assuming that ni? is the dom- 
inant term and that c grows much slower than 
(e.g., in natural networks, c is often 0(m)), the sparse 
computation can be an order of magnitude faster. 

Additionally, since W can be real-valued (not just 
{0,1}), we can adjust the entries to reduce the mis- 
take penalty of certain examples. For instance, sup- 
pose an incorrect negative prediction is deemed more 
critical than an incorrect positive (as is often the case 
in medical diagnoses and certain link prediction tasks). 
One could multiplicatively increase the values of all 
{wi,j,k '■ yij,k = 1} or, alternatively, decrease the val- 
ues of all {wij^k '■ yi,j,k = — !}• This would effectively 
penalize false negatives more severely than false posi- 
tives, encouraging the optimization to satisfy positive 
examples. 

3.5 Optimization 

To minimize the objective in Equation 2, we use 
limited-memory Broyden-Fletcher- Goldfarb-Shanno 
(L-BFGS) optimization. Since quasi-Newton meth- 
ods, such as L-BFGS, avoid computing the Hessian, 
they are efficient for optimization problems involving 
many variables. All this requires is the objective 
function and the gradients in Equations 3-5. Since 
our optimization problem is non-convex, we are not 
guaranteed that L-BFGS will find the global mini- 
mum; in practice, however, the algorithm typically 
finds useful, though possibly local, minima. 

To mitigate the possibility of finding local minima, we 
initialize the parameters using the eigendecomposition 
of each input slice, which is close to the desired factor- 
ization. This technique is similar to the initialization 
used by Bader et al. (2007) and Nickel et al. (2011), 
which have similar decompositions. For k = 1, . . . , n, 
let Afe = (Al fe, . . . , Xr.k) denote the r largest eigenval- 
ues of Yfc, and let Vfc = (vi^^, . . . ,Vr,k) denote their 
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corresponding eigenvectors. We initialize as a di- 
agonal matrix with A/j along the diagonal, and A as 
the average of Vi, . . . , V„. In practice, we find that 
this initialization converges faster, and often to a bet- 
ter solution, than random initialization. 

Note that when the objective function uses only 
quadratic loss, one can compute the parameter up- 
dates using the alternating simultaneous approxima- 
tion, least squares and Newton (ASALSAN) algorithm 
(Bader et al., 2007), which produces an approximate 
solution and has been shown to converge quickly. Since 
our objective may contain a heterogeneous mixture of 
loss functions — not all necessarily quadratic — we do 
not use ASALSAN. Morever, we cannot use traditional 
convex programming techniques like semidefinite pro- 
gramming (SDP) because our objective is non-convex. 

4 Experiments 

In this section, we compare variants of MrWTD with 
RESCAL (Nickel et al., 2011), MMMF (Rennie and 
Srebro, 2005a) and Bayesian probabilistic tensor fac- 
torization (BPTF) (Xiong et al., 2010) in several ex- 
periments, using both real and synthetic data. The 
real data sources are kinship data from the Australian 
Alyawarra tribe, and two social interaction datasets 
from the MIT Media Lab. The comparisons highlight 
the critical advantages of MrWTD: namely, the ability 
to learn from limited training data, handle a mixture 
of learning objectives, transfer information across re- 
lations for collective learning, and exploit sparsity for 
improved efBciency. 

To test the effect of the rank parameter, we run an ex- 
periment varying only the rank of the decomposition 
over a range of values. The results support our hy- 
pothesis that L2 regularization reduces the impact of 
the rank, effectively controlling the model complexity. 

Finally, we perform a synthetic experiment to compare 
the running time of MrWTD to that of the above com- 
peting methods, demonstrating the significant scala- 
bility gains provided by exploiting sparsity. 

To conserve space, certain figures and tables are pro- 
vided in the supplementary material (Appendix A). 

4.1 Compared Methods 

To evaluate the performance of various loss functions, 
we compare several variants of MrWTD. The vari- 
ant named MrWTD-Q uses the quadratic loss for all 
relations, regardless of their type. MrWTD-H and 
MrWTD-L use the quadratic loss for real-valued slices 
and the smooth hinge or logistic loss, respectively, for 
binary slices. 



The RESCAL model approximates each slice of the in- 
put tensor as « AR;jA^. In (Nickel et al., 2011), 
binary relationships are represented by {0, 1}. Unfor- 
tunately, since RESCAL does not account for miss- 
ing data, unobserved relationships are simply treated 
as negative examples. In order to distinguish be- 
tween (un)observcd relationships and negative exam- 
ples, we use {±1} for observed data and zeros else- 
where. In our experiments, we find that this modifica- 
tion improves RESCAL's performance over the orig- 
inal method. Since RESCAL uses the quadratic loss 
uniformly, it uses ASALSAN to compute the decom- 
position, with L2 regularization on A and Rfe. 

MMMF is a tool for matrix reconstruction and, as 
such, is not designed for multi-relational data. That 
said, we can use it to reconstruct each slice of the ten- 
sor individually. Like MrWTD, MMMF approximates 
a binary input Y using the sign of a rank-r matrix 
decomposition, Y « sgn(UV''"), where U, V e M'"^''. 
The "fast" variant of the algorithm (Rennie and Sre- 
bro, 2005a) adds a bias term and uses the smooth hinge 
loss. The optimization objective is very similar to ours, 
but with different gradients, due to the decomposition. 
Our implementation of fast MMMF differs from that 
of Rennie and Srebro (2005a) only in the way we solve 
the optimization (using L-BFGS, rather than conju- 
gate gradient descent) and the fact that the input is 
assumed to be symmetric. 

Because the synthetic data generator (described in 
Section 4.2) matches our decomposition and is slightly 
different than that of traditional MMMF, it is 
somewhat unfair to compare traditional MMMF to 
MrWTD. We therefore run a variant of MrWTD that 
decomposes each slice separately instead of jointly, us- 
ing a separate A^. This is meant to equalize the dis- 
crepancy in the decomposition, while isolating the de- 
ficiencies of non-collective learning. We refer to this 
model as MMMF-I-. 

BPTF is a fully Bayesian interpretation of the CP ten- 
sor factorization, originally designed for temporal pre- 
diction.^ We compare it to MrWTD to investigate 
the benefits and drawbacks of the Bayesian approach. 
BPTF assumes that all latent factors are sampled from 
a Gaussian distribution. The only user-defined proper- 
ties are the rank of the decomposition and the hyper- 
parameters. The parameters and latent factors are 
estimated using Gibbs sampling. One benefit of the 
Bayesian approach is that it avoids the model selection 
problem, which in our case is the choice of regulariza- 



*Sutskever et al. (2009) propose another fully Bayesian 
algorithm, Bayesian tensor factorization (BTF), whose de- 
composition is very similar to ours, though their framework 

only supports the quadratic loss. Wc wore unable to com- 
pare MrWTD to this method at the time of submission. 
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tion parameters."'' This ean have a pronounced ciffect 
when training data is Umited, which makes proper reg- 
ularization critical. However, Gibbs sampHng is com- 
putationally expensive, since it requires many itera- 
tions of sampling to converge to an accurate estimate. 
We analyze this trade-off between accuracy and effi- 
ciency in Section 4.5. Additionally, BPTF only sup- 
ports the quadratic loss, since it has a natural prob- 
abilistic interpretation as the Gaussian likelihood and 
makes the model conjugate, making Gibbs sampling 
easier. No such interpretation exists for the (smooth) 
hinge loss, and the logistic loss has no conjugate prior. 

We implement all of the above methods in MATLAB, 
using a third-party implementation of L-BFGS^, and 
the authors' implementation of BPTF''. 

4.2 Synthetic Data Experiments 

To generate the synthetic data, we start by computing 
a low-rank tensor X e M^^^x" as Xfc ^ ARfcA^ + 
Efc, for A; = 1, . . . , n, where A e M'"^'" and Rfe € M''^'^ 
are sampled from a normal distribution, and e 
]gmxm jg low-level, normally-distributed noise. For the 
first experiment, we construct n = 3 binary relations 
(i.e., slices), over m = 500 objects, using rank r = 
10. We refer to this dataset as Binary Synthetic. To 
generate a binary tensor Y G we round 

the values of X using the 90'''^ percentile of its values 
as a threshold. This produces a heavy skew towards 
the negative class, as is typical in real multi-relational 
data. For the second experiment, we construct one 
binary relation and one real- valued relation, again over 
500 objects, with rank 10. We normalize the real- 
valued relation such that the standard deviation is 1.0, 
giving it roughly the same scale as the binary slices. 
We refer to this dataset as Mixed Synthetic. 

We evaluate over training sizes t S [3, 25] percent, av- 
eraging the results over 20 runs per size. In each run, 
we sample a random t ■ (™) pairs (and their symmetric 
counterparts) from each slice to use as the training set, 
and let the remaining pairs comprise the test set. We 
then hold out a random 25% from the training set as 
a validation set for a regularization parameter search, 
where we search over the range [10^"^, 10^] in logarith- 
mic increments. For Binary Synthetic, we select the 
optimal parameter A* that maximizes the area under 
the precision-recall curve (AUPRC), averaged over all 
slices; for Mixed Synthetic, we maximize the harmonic 
mean of the AUPRC of the first slice and one minus 
the mean-squared error (MSE) of the second. We then 

^As the authors claim, the effect of tuning the hyper- 
parameter priors is minimal. 

^www. di . ens . fr/~mschmidt/Sof tware/minFunc .html 
^www. cs . emu. edu/~lxiong/bptf /bptf .html 



retrain on the full training scit using A* and evaluate 
on the test set. For BPTF, we run Gibbs sampling for 
200 iterations. 

The results of the synthetic data experiments are given 
in Figure 2, reported as average AUPRC and MSE over 
20 runs. On Binary Synthetic, MrWTD-L achieves a 
statistically significant^ lift over the competing meth- 
ods for training sizes 5% and up, and all three vari- 
ants showing significant lift for 10% and above. We 
attribute these results to two primary advantages: the 
weighted objective function, with its mixture of task- 
specific loss functions, and the global latent factors. As 
discussed in Section 3.5, the weighted objective is nec- 
essary for exploiting small amounts of observed (i.e., 
training) data, without fitting the unobserved entries. 
Since RESCAL treats all entries as observed, it tends 
to fit the unobserved entries in sparsely populated ten- 
sors. Furthermore, though MMMF and MMMF-I- use 
the same large-margin technique as MrWTD-H and 
MrWTD-L, they do not perform collective learning, 
since the latent factors are specific to each slice. In 
MrWTD, information from one slice is propagated to 
the others via the global latent factors. Note that 
BPTF and MrWTD-Q perform significantly worse the 
large- margin loss variants of MrWTD for sizes 10% 
and above, illustrating that the quadratic loss is not 
always appropriate for binary data. On Mixed Syn- 
thetic, MrWTD 's improvement over RESCAL and 
MMMF, for both slices, is statistically significantly 
for all training sizes. MMMF-I- is competitive with 
MrWTD on the real- valued slice, with significant lift 
for training sizes 3, 5%; yet its performance deterio- 
rates on the binary slice, for all training sizes, since 
it is not able to transfer information between slices. 
BPTF is also competitive with MrWTD on the real- 
valued slice, and the binary slice for smaller training 
sizes, but falls slightly behind on the higher sizes. 

4.3 Real Data Experiments 

We evaluate on several real multi-relational datasets. 
The first dataset consists of kinship data from the Aus- 
tralian Alyawarra tribe, as recorded by Denham and 
White (2005). This data has previously been used 
by Kemp et al. (2006) for multi-relational link pre- 
diction. The data contains m = 104 tribe members 
and n = 23 types of kinship (binary) relations.^ In 
total, the dataset includes 125,580 related pairs. This 
yields a tensor Y e {±i}i04xi04x23 

The remaining datasets come from MIT's Human Dy- 

*We measure statistical significance in all experiments 
using a 2-sample t-test with rejection threshold 0.05. 

®The original data contains 26 relations, but relations 
24-26 are extremely sparse, exhibiting fewer than 6 in- 
stances, so we omit them. 
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(a) Binary Synthetic (b) Mixed Synthetic (binary shce) (c) Mixed Synthetic (real shce) 

Figure 2: Results of the synthetic data experiments (discussed in Section 4.2). The horizontal axis indicates the 
size of the training data (in percentage of the tensor); in (a) and (b), the vertical axis indicates the area under 
the precision-recall curve; in (c), the vertical axis shows the mean squared error (MSE). All scores are averaged 
over 20 runs. The top to bottom arrangement of the legend corresponds to a left to right arrangement in each 
group. See Table 2 in the appendix for the full results, including standard deviations. 



namics Laboratory. Both consist of human interac- 
tion data from students, faculty, and staff working on 
the MIT campus, recorded by a mobile phone appli- 
cation. From the first dataset, named Reality Mining 
(Eagle et al., 2009), we use the survey-annotated net- 
work, consisting of n = 3 types of binary relationships 
annotated by the subjects: friendship, in-lab interac- 
tion and out-of-lab interaction. These relationships 
are measured between m = 94 participants, provid- 
ing a total of 13,395 related pairs. The resulting ten- 
sor is Y e |-|-2^|94x94x3 Prom the second dataset, 
named Social Evolution (Dong et al., 2011), we use 
the survey-annotated network, as well as several inter- 
action relations derived from sensor data, resulting in 
n = 8 binary relations with 16,101 related pairs. The 
five surveyed relations are: close friendship, biweekly 
social interaction, political discussion, two types of so- 
cial media interaction. The three derived relations are 
computed from: voice calls, SMS messaging and prox- 
imity. We binarize this data by a simple indicator of 
whether the given type of interaction occurred. In this 
case, the number of users is to = 84, results in a tensor 

Y £ |_|_2^|84x84x8 

For these experiments, we use the same methodology 
as the synthetic experiments, with rank r = 20. The 
results are also given in Figure 3. The three vari- 
ants of MrWTD and BPTF achieve significant lift over 
RESCAL, MMMF and MMMF-f in nearly all experi- 
ments. MrWTD has a statistically significant advan- 
tage over the other methods for most training ratios 
on the Kinship data, while BPTF has an advantage 
on the Social Evolution data. Yet, as we show in the 
following section, MrWTD's estimation takes a small 
fraction of BPTF's running time. We therefore achieve 
results that are comparable to Bayesian methods in far 



less time. We refer the reader to Table 1 in the ap- 
pendix for the complete set of results. 

4.4 Rank Experiment 

To measure the effect of the rank parameter on the 
performance of each algorithm, we rerun the Social 
Evolution experiment, varying r = {5,10,20,40} and 
keeping the training ratio is fixed at 25%. The re- 
sults of this experiment are displayed in Figure 5, in 
the appendix. There is a small increase in AUG from 
r = 5 to r = 10, which is expected, since 5 is relatively 
low. However, we find that the effect of the rank is 
minimal for r > 10; the standard deviation across all 
runs in this range is < 0.02 for each algorithm. This 
supports our hypothesis that, beyond a certain thresh- 
old, the regularizer is the primary controller of model 
complexity. 

4.5 Timing Experiment 

Finally, we measure the running time of each of the 
above tensor methods to better understand their scal- 
ability in scenarios where training data is limited. We 
create a sequence of synthetic datasets (using the tech- 
nique in Section 4.2), each with n — 3 binary slices, 
for sizes m = {500,1000,2000,4000,8000}. For train- 
ing, we use a random 10% of the tensor. We compare 
the smooth hinge loss variant of MrWTD, RESCAL 
and BPTF, using predefined regularization and hyper- 
parameters. We run these experiments on a machine 
with two 6-core Intel® Xeon® X5650 processors, run- 
ning at 2.66 GHz, and 48 GB of RAM. 

The timing results, averaged over 10 runs per problem 
size, are shown in Figure 4. BPTF takes considerably 
more time than the others, due to its Gibbs sampling 
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% train % train % train 



(a) Kinship (b) Reality Mining (c) Social Evolution 

Figure 3: Results of the real data experiments (discussed in Section 4.3). The horizontal axis indicates the size 
of the training data (in percentage of the tensor); the vertical axis indicates the area under the precision-recall 
curve, averaged over 20 runs. The top to bottom arrangement of the legend corresponds to a left to right 
arrangement in each group. See Table 3 in the appendix for the full results, including standard deviations. 



estimation. Note that we could not run BPTF on the 
two largest problem sizes, due to out-of-memory excep- 
tions. This illustrates the tradeoff between accuracy 
and efficiency in using Bayesian methods; one can re- 
duce running time by reducing the number of itera- 
tions, but this would also affect the accuracy of the 
estimation. Due to the efficient, closed- form updates 
of the ASALSAN algorithm, RESCAL is the fastest 
for small problem sizes. However, MrWTD is signifi- 
cantly faster as the problem size grows. This is because 
RESCAL's objective function treats all tensor entries 
with equal importance, whereas MrWTD's weighted 
objective only requires the predictions of the observed 
entries, thus allowing us to skip prediction on the test 
data during estimation. 

5 Conclusion 

In this paper, we present a modular framework for 
multi-relational learning via tensor decomposition. 
The decomposition we use provides an intuitive 
interpretation for the multi-relational domain, where 
objects have global latent representations and rela- 
tionships are determined by a function of their linear 
combinations. We show that the global latent rep- 
resentations enable information to transfer between 
relation types during model estimation. Further, we 
demonstrate that our framework's weighted objective 
and support for multiple loss functions improves 
accuracy over similar models. Finally, we show that 
our method exploits the sparsity of limited training 
data to achieve an order of magnitude speedup over 
unweighted methods. 

We plan to extend MrWTD to be able to learn from 
large-scale data by adapting hashing methods from 
matrix factorization literature (Karatzoglou et al., 




2000 4000 6000 8000 



number of objects 

Figure 4: Results of the timing experiment (discussed 
in Section 4.5). The horizontal axis indicates the size 
of a synthetic dataset, measured by the number of ob- 
jects m; the vertical axis indicates the running time 
(in seconds), averaged over 10 runs. For each dataset, 
we use 10% for training. Note that BPTF could not 
run on sizes {4000, 8000} due to runtime exceptions. 



2010). We would also like to compare our method 
to Sutskever et al.'s BTF algorithm (2009), to further 
investigate the benefit of the Bayesian approach. We 
also intend to analyze the theoretical properties of our 
framework, such as generalization error, using existing 
learning theory literature (Srebro et al., 2005a; Cortes 
et al., 2008; El-Yaniv and Pechyony, 2009). 

Acknowledgements 

This work was partially supported by NSF CAREER 
grant 0746930 and NSF grant IIS1218488. 



London, Rekatsinas, Huang, Getoor 



References 

E. Acar, D. Dunlavy, T. Kolcla, and M. M0rup. Scal- 
able tensor factorizations with missing data. In 
Proc. of the 2010 SI AM International Conf. on Data 
Mining (SDM), 2010. 

B. Bader, R. Harshman, and T. Kolda. Temporal anal- 
ysis of semantic graphs using ASALSAN. In Proc. of 
the 7th IEEE International Conf. on Data Mining 
(ICDM), 2007. 

C. Cortes, M. Mohri, D. Pechyony, and A. Rastogi. 
Stability of transductive regression algorithms. In 

Proc. of the 25th International Conf. on Machine 
Learning (ICML), 2008. 

W. Denham and D. White. Multiple measures of 
Alyawarra kinship. Field Methods, 17(1), 2005. 

W. Dong, B. Lepri, and A. Pcntland. Modeling the co- 
evolution of behaviors and social relationships using 
mobile phone data. In Proceedings of the 10th Inter- 
national Conference on Mobile and Ubiquitous Mul- 
timedia, MUM '11, pages 134-143, New York, NY, 
USA, 2011. ACM. 

D. Dunlavy, T. Kolda, and W. Kegelmeyer. Multilin- 
ear algebra for analyzing data with multiple link- 
ages. Technical Report, 2006. 

D. Dunlavy, T. Kolda, and E. Acar. Temporal link 

prediction using matrix and tensor factorizations. 
ACM Trans, on Knowledge Discovery from Data, 5 
(2), 2011. 

N. Eagle, A. Pentland, and D. Lazer. Inferring friend- 
ship network structure by using mobile phone data. 
Proc. of the National Academy of Sciences, 106(36), 
2009. 

R. El-Yaniv and D. Pechyony. Transductive 
Rademacher complexity and its applications. J. Ar- 
tificial Intelligence Research (J AIR), 35, 2009. 

S. Gao, L. Denoyer, and P. Gallinari. Link pat- 
tern prediction with tensor decomposition in multi- 
relational networks. In IEEE Symposium on Comp. 
Intell. and Data Mining, 2011. 

R. Harshman. Models for analysis of asymmetrical 
relationships. In First Joint Meeting of the Psy- 
chometric Society and the Society for Mathematical 
Psychology, 1978. 

A. Karatzoglou, A. Smola, and M. Weimcr. Collabora- 
tive filtering on a budget. In Proc. of the 13th Inter- 
national Conf. on Artificial Intelligence and Statis- 
tics, 2010. 

H. Kashima, T. Kato, Y. Yamanishi, M. Sugiyama, 
and K. Tsuda. Link propagation: a fast semi- 
supervised learning algorithm for link prediction. 
In SIAM International Conference on Data Mining 
(SDM), 2009. 



C. Kemp, J. Tcncnbaum, T. Griffiths, T. Yamada. and 
N. Ueda. Learning systems of concepts with an infi- 
nite relational model. In Proc. of the 21st National 
Conf. on Artificial Intelligence, 2006. 

M. Nickel, V. Tresp, and H. Kriegel. A three-way 
model for collective learning on multi-relational 
data. In Proc. of the 28th International Conf. on 
Machine Learning (ICML), 2011. 

J. Rennie and N. Srebro. Fast maximum margin ma- 
trix factorization for collaborative prediction. In In 
Proc. of the 22nd International Conf. on Machine 

Learning (ICML), 2005a. 

J. Rennie and N. Srebro. Loss functions for preference 
levels: regression with discrete ordered labels. In 

IJCAI Multidisciplinary Workshop on Adv. in Pref- 
erence Handling, 2005b. 

N. Srebro, N. Alon, and T. Jaakkola. Generalization 
error bounds for collaborative prediction with low- 
rank matrices. In Advances in Neural Information 
Processing Systems 17. 2005a. 

N. Srebro, J. Rennie, and T. Jaakkola. Maximum- 
margin matrix factorization. In Advances in Neural 
Information Processing Systems 17. 2005b. 

I. Sutskever, R. Salakhutdinov, and J. Tenenbaum. 
Modelling relational data using Bayesian clustered 

tensor factorization. In Advances in Neural Informa- 
tion Processing Systems 22, pages 1821-1828. 2009. 

L. Xiong, X. Chen, T. Huang, J. Schneider, and 
J. Carbonell. Temporal collaborative filtering with 
Bayesian probabilistic tensor factorization. In SIAM 
International Conference on Data Mining (SDM), 
2010. 



Multi-relational Learning Using Weighted Tensor Decomposition 



A Supplementary Material 

Here we report additional results from the experiments 
discussed in Section 4. Figure 5 shows the results of 
the rank experiment (Section 4.4). Table 1 shows the 
results of the timing experiment (Section 4.5). In Ta- 
ble 2 and Table 3, we list the full results of the syn- 
thetic (Section 4.2) and real data (Section 4.3) exper- 
iments. 
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Figure 5: Results of the rank data experiment (dis- 
cussed in Section 4.4), using the Social Evolution 
dataset. The horizontal axis indicates the rank of the 
decomposition; the vertical axis indicates the area un- 
der the precision-recall curve, averaged over 20 runs. 
We use a random 25% of the tensor for training data 
on each run. There is a slight increase from r = 5 
to r = 10, but less than 0.02 standard deviation (per 
algorithm) for r > 10, supporting our hypothesis that 
regularization controls model complexity. 



Table 1: Results of the timing experiment (discussed 
in Section 4.5). The first column indicates the size of 
a synthetic dataset, measured by the number of ob- 
jects m; the remaining columns indicate the running 
time (in seconds), averaged over 10 runs, with the as- 
sociated standard deviations. For each dataset, we use 
10% for training. Note that BPTF could not run on 
sizes {4000, 8000} due to runtime exceptions. 



m 


RESCAL 


BPTF 


MrWTD-H 


500 


0.21 


(0.05) 


1.52 (0.05) 


0.59 (0.01) 


1000 


0.67 


(0.10) 


6.51 (0.42) 


1.35 (0.37) 


2000 


3.39 


(0.35) 


25.47 (1.57) 


5.25 (0.05) 


4000 


32.96 


(5.02) 


(-) 


20.33 (0.80) 


8000 


332.77 


(40.88) 


(-) 


80.18 (2.44) 
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Table 2: Results of the synthetic data experiments (discussed in Section 4.2). The first column indicates the 
dataset and score type. (For Mixed Synthetic, we provide two row groups to display the scores of the binary- 
and real- valued slices.) The second column indicates the amount of training data (in percentage of the tensor). 
We evaluate on three variants of MrWTD: -Q uses the quadratic loss for all slices; -H and -L use the quadratic 
loss for real-valued slices, but use the smooth hinge and logistic losses, respectively, for binary slices. For Binary 
Syntheticand the first slice of Mixed Synthetic, we report area under the precision- recall curve (AUPRC); for 
the second slice of Mixed Synthetic, we report mean squared error (MSE). Standard deviations are listed in 
parentheses. All scores are averaged over 20 runs. Bold scores are statistically tied with the best score in each 
row. 



Data 


Tr% 


RESCAL 


MMMF 


MMMF+ 


BPTF 


MrWTD-Q 


MrWTD-H 


MrWTD-L 




3 


0.14 


(.00) 


0.13 (.04) 


0.13 (.00) 


0.15 (.01) 


0.13 (.01) 


0.12 (.01) 


0.13 (.01) 




5 


0.16 


(.01) 


0.16 


(.03) 


0.17 (.01) 


0.27 (.03) 


0.23 (.03) 


0.20 (.08) 


0.32 (.03) 


Binary Synth 


10 


0.21 


(.03) 


0.24 


(.04) 


0.36 (.01) 


0.50 (.01) 


0.55 (.02) 


0.70 (.02) 


0.71 (.02) 


(AUPRC) 


15 


0.38 


(.03) 


0.33 


(.01) 


0.51 (.01) 


0.59 (.01) 


0.66 (.02) 


0.83 (.00) 


0.84 (.00) 




20 


0.49 


(.02) 


0.37 


(.01) 


0.61 (.01) 


0.63 (.01) 


0.70 (.02) 


0.89 (.00) 


0.90 (.00) 




25 


0.57 


(.02) 


0.40 


(.01) 


0.69 (.01) 


0.66 (.01) 


0.73 (.02) 


0.92 (.00) 


0.93 (.00) 




3 


0.14 


(.01) 


0.11 


(.03) 


0.13 (.01) 


0.26 (.02) 


0.27 (.03) 


0.27 (.03) 


0.26 (.04) 




5 


0.18 


(.01) 


0.13 


(.01) 


0.18 (.03) 


0.37 (.02) 


0.36 (.03) 


0.37 (.03) 


0.36 (.04) 


Binary Synth 


10 


0.27 


(.02) 


0.30 


(.02) 


0.35 (.03) 


0.57 (.02) 


0.55 (.03) 


0.55 (.03) 


0.54 (.04) 


(AUPRC) 


15 


0.33 


(.01) 


0.25 


(.08) 


0.48 (.01) 


0.65 (.01) 


0.67 (.04) 


0.67 (.02) 


0.67 (.02) 




20 


0.39 


(.02) 


0.20 


(.03) 


0.57 (.02) 


0.70 (.01) 


0.74 (.03) 


0.74 (.02) 


0.75 (.02) 




25 


0.43 


(.01) 


0.24 


(.06) 


0.66 (.01) 


0.71 (.01) 


0.74 (.02) 


0.74 (.02) 


0.76 (.02) 




3 


0.99 


(.01) 


0.99 


(.01) 


0.43 (.08) 


0.48 (.03) 


0.44 (.04) 


0.44 (.04) 


0.45 (.03) 




5 


0.97 


(.01) 


0.96 


(.02) 


0.12 (.07) 


0.24 (.02) 


0.19 (.02) 


0.19 (.01) 


0.19 (.01) 


Mixed Synth 


10 


0.90 


(.00) 


0.92 


(.01) 


0.03 (.01) 


0.06 (.01) 


0.05 (.04) 


0.04 (.00) 


0.05 (.04) 


(MSE) 


15 


0.83 


(.00) 


0.90 


(.02) 


0.02 (.00) 


0.03 (.00) 


0.03 (.02) 


0.03 (.02) 


0.04 (.02) 




20 


0.77 


(.01) 


0.84 


(.00) 


0.02 (.01) 


0.02 (.00) 


0.02 (.01) 


0.03 (.01) 


0.03 (.01) 




25 


0.70 


(.01) 


0.81 


(.00) 


0.01 (.00) 


0.01 (.00) 


0.01 (.00) 


0.01 (.00) 


0.01 (.00) 



Table 3: Results of the real data experiments (discussed in Section 4.3). Standard deviations are listed in 
parentheses. All scores are averaged over 20 runs. Bold scores are statistically tied with the best score in each 
row. 



Data 


Tr% 


RESCAL 


MMMF 


MMMF+ 


BPTF 


MrWTD-Q 


MrWTD-H 


MrWTD-L 




3 


0.08 


(.00) 


0.08 


(.00) 


0.08 


(.00) 


0.10 (.03) 


0.10 (.04) 


0.09 (.01) 


0.11 (.03) 




5 


0.08 


(.01) 


0.09 


(.00) 


0.09 


(.00) 


0.26 (.03) 


0.33 (.02) 


0.31 (.04) 


0.33 (.02) 


Kinship 


10 


0.13 


(.01) 


0.11 


(.00) 


0.15 


(.01) 


0.49 (.02) 


0.44 (.13) 


0.46 (.03) 


0.48 (.02) 


(AUPRC) 


15 


0.18 


(.01) 


0.14 


(.01) 


0.21 


(.01) 


0.57 (.01) 


0.61 (.01) 


0.42 (.24) 


0.57 (.12) 




20 


0.28 


(.01) 


0.16 


(.01) 


0.27 


(.01) 


0.59 (.01) 


0.65 (.02) 


0.65 (.02) 


0.63 (.01) 




25 


0.34 


(.01) 


0.18 


(.00) 


0.33 


(.02) 


0.61 (.01) 


0.68 (.01) 


0.70 (.01) 


0.69 (.01) 




3 


0.09 


(.01) 


0.09 


(.01) 


0.09 


(.01) 


0.11 (.01) 


0.09 (.01) 


0.09 (.01) 


0.09 (.01) 




5 


0.09 


(.01) 


0.12 


(.08) 


0.10 


(.01) 


0.13 (.02) 


0.11 (.02) 


0.11 (.02) 


0.11 (.02) 


Reality 


10 


0.10 


(.01) 


0.13 


(.03) 


0.13 


(.02) 


0.20 (.03) 


0.17 (.03) 


0.17 (.03) 


0.18 (.04) 


(AUPRC) 


15 


0.11 


(.01) 


0.19 


(.03) 


0.17 


(.03) 


0.24 (.03) 


0.27 (.06) 


0.23 (.06) 


0.23 (.08) 




20 


0.13 


(.01) 


0.22 


(.06) 


0.21 


(.04) 


0.29 (.03) 


0.32 (.04) 


0.27 (.06) 


0.29 (.08) 




25 


0.14 


(.01) 


0.26 


(.03) 


0.25 


(.05) 


0.31 (.04) 


0.34 (.02) 


0.31 (.05) 


0.34 (.04) 




3 


0.34 


(.03) 


0.29 


(.02) 


0.30 


(.01) 


0.41 (.01) 


0.32 (.01) 


0.32 (.01) 


0.32 (.01) 




5 


0.35 


(.03) 


0.30 


(.01) 


0.31 


(.01) 


0.45 (.02) 


0.35 (.03) 


0.36 (.04) 


0.36 (.03) 


Social 


10 


0.36 


(.01) 


0.33 


(.01) 


0.36 


(.01) 


0.51 (.01) 


0.50 (.02) 


0.47 (.03) 


0.43 (.06) 


(AUPRC) 


15 


0.45 


(.01) 


0.38 


(.04) 


0.41 


(.02) 


0.56 (.01) 


0.54 (.02) 


0.54 (.02) 


0.54 (.03) 




20 


0.52 


(.02) 


0.41 


(.02) 


0.44 


(.02) 


0.60 (.02) 


0.57 (.01) 


0.57 (.02) 


0.58 (.02) 




25 


0.56 


(.01) 


0.44 


(.01) 


0.47 


(.01) 


0.63 (.01) 


0.60 (.01) 


0.60 (.01) 


0.62 (.01) 



